Main
Mike L. Smith
Senior Scientific Programmer
I am a research software engineer at the European Molecular Biology Laboratory.
Employed as part of the German Network for Bioinformatics Infrastructure (de.NBI), I develop and maintain a variety of tools for the analysis of biological data. In particular, I support many packages and tools for the Bioconductor project and its community, and am a member of both the Bioconductor core team and the Community Advisory Board.
I work closely with experimental scientists and other programmers to create robust, usable, and performant analysis tools. I enjoy using software to automate tasks in order to provide rapid deployment and feedback.
I am also passionate about good software practices for reproducible research. I work extensively with version control, containers, and literate programming tools like R Markdown, champion these practices to those around me, and have taught multiple courses on these topics.
Professional Experience
Senior Scientific Programmer
European Molecular Biology Laboratory
Heidelberg, DE
2019-
- Developed and deployed Bioconductor Code Tools website. Using Docker containers and a Kubernetes deployment, this site automatically syncs with the central Bioconductor git repository and provides tools for browsing and searching the code behind all Bioconductor packages.
- Designed and maintained Bioconductor GitHub Actions to simplify using GitHub actions for BioC package development - grimbough/bioc-actions
- Ported existing HDF5 compression filters to R via the rhdf5filters package, benchmarked their performance on single-cell data, and researched the effectiveness of run-length encoding and bit-packing via new filters written in C.
- Created a continuous integration workflow using GitHub Actions for the Quarto book Modern Statistics for Modern Biology. This rapidly builds chapters in parallel, deploys a new edition of the book if successful, and alerts authors via email if any problems are encountered.
- Organised and hosted monthly Bioconductor Developer Forum and embl-R discussion sessions
- Consulting and/or mentoring junior software developers, both within EMBL and the wider community, to improve their R skills and develop their own packages.
Bioinformatician
European Molecular Biology Laboratory
Heidelberg, DE
2015-19
- Maintained and continued development of several widely used Bioconductor packages with extensive userbases and thousands of downloads per month e.g. biomaRt and rhdf5. This involved:
- Modernising and strengthening the code base via code review & development of unit tests
- Updating documentation and vignettes
- Providing end-user support via email, online forums, and GitHub issues
- In collaboration with experimental biologists, developed software for the analysis of pooled CRISPR-based screens
- Developed workflows for analysis of bulk RNA-seq data, deployed on an HPC cluster
- Created BiocWorkflowTools for publishing R Markdown documents as both Bioconductor Workflows and publications
Research Associate
Cancer Research UK Cambridge Institute
Cambridge, UK
2013-15
- Wrote and deployed workflows for analysing structural variation data as part of the Oesophageal ICGC project
- Developed quality control software for Oxford Nanopore sequencing data
- Researched the impact data quality had on downstream analysis and results
Software Development Community Engagment
Bioconductor Community Advisory Board
Elected to a three year term on the Community Advisory Board, where the aim is to engage the user and developer communities with training, outreach and a welcoming environment. As part of this I have been involved with the following:
N/A
2020-
- New Developer Program - This program aims to encourage new developers to make the jump from scripting into package development by pairing them with more experienced mentors. As co-lead I have been responsible for designing the program, soliciting and reviewing applications from mentors and mentees, creating mentorship pairings, and checking on progress and satisfaction with the scheme.
- Package Review Working Group - Created to review and revise the process via which packages are accepted into Bioconductor, we have systematically updated the guidelines for packages authors and submission. We have also successfully recruited a new cohort of reviewers to speed up the review process.
- Privacy Working Group - With its large community of users and many websites and services hosted by a variety of organisations around the globe, data privacy is a serious issue for Bioconductor. We are engaged in making sure that Bioconductor services meet both legal requirements and community expectations regarding personal data privacy.
embl-R Coding Club
Co-organiser and host of EMBL’s longest running programming group. We hold bi-weekly tutorials, package demos, talks and discussions on anything R related. I have personally taught sessions on package development, data wrangling, parallel processing among others, as well as arranging the program of speakers.
N/A
2020-21
Bioconductor Developers’ Forum
Organiser and host of the monthly developers’ forum, a series of presentations
and workshops intended to bring the developer community closer together. This
has included presentations by members of R Core, RStudio, rOpenSci and Microsoft.
Youtube Playlist
N/A
2019-21
Education
University of Cambridge
PhD, Computational Biology,
Department of Oncology
Cambridge, UK
2009-12
Thesis: Low-level artefacts affecting microarrays and next-generation sequencing in a cancer genomics environment
Cardiff University
MSc (with Distinction) in Bioinformatics
Cardiff, UK
2007-08
Dissertation: The development of parallel processing techniques for the analysis of genome wide association studies
University of Bath
BSc (2.2) in Mathematics with Computing
Bath, UK
2003-07
Dissertation: A distributed computing approach to finding missing genes using protein threading
Teaching Experience
Advanced topics in single-cell transcriptomics
Working with on-disc data formats
Swiss Institute for Bioinformatics, Online
2020
BBSRC Advanced Methods for Reproducible Science Workshop
Introduction to R Markdown and literate programming for reproducible research
Windsor, UK
2018-20
EMBL Software Carpentry
Introduction to HPC with Slurm
Heidelberg, DE
2016-18 2020
Statistical Data Analysis for Genome-Scale Biology (CSAMA)
A one week intensive course teaching analysis of multi-omics studies. Variously I have taught, provided online and in-person technical support, administered the course website and teaching materials, and reviewed applications from students
Brixen, IT
2015-19 2022
Prizes, Awards, and Grants
CZI Funding Call - Single-cell biology
Statistical Analysis and Comprehension of the Human Cell Atlas in R / Bioconductor: Access and Scalable Infrastructure - $45,000
N/A
2018
Applied in collaboration with Wolfgang Huber
RStudio Bookdown Contest
Runner-up. Awarded for msmbstyle, a tufte inspired markdown theme.
N/A
2018
UseR 2011 - Best Technical Poster Prize
N/A
N/A
2011
BioC Conference 2011 Travel and Accommodation Scholarship
N/A
N/A
2011
Publications
First Author
- Mike L. Smith, Andrzej K. Oleś, Wolfgang Huber. Authoring Bioconductor workflows with BiocWorkflowTools [version 1; referees: awaiting peer review]. F1000Research (2018)
- Smith ML, Baggerly KA, Bengtsson H, Ritchie ME, Hansen KD. illuminaio: An open source IDAT parsing tool for Illumina microarrays. F1000research (2013)
- Smith ML, Dunning MJ, Tavaré S, Lynch AG. Identification and correction of previously unreported spatial phenomena using raw Illumina BeadArray data. BMC Bioinformatics (2010)
- Smith ML, Lynch AG. BeadDataPackR: A Tool to Facilitate the Sharing of Raw Data from Illumina BeadArray Studies. Cancer Informatics (2010)
Contributing Author
- Rozemarijn W. D. Kleinendorst, Guido Barzaghi, Mike L. Smith, Judith B. Zaugg, Arnaud R. Krebs. Genome-wide quantification of transcription factor binding at single-DNA-molecule resolution using methyl-transferase footprinting. Nature Protocols (2021)
- Alexandros P. Drainas, Ruxandra A Lambuta, Irina Ivanova, Özdemirhan Serçin, Ioannis Sarropoulos, Mike L. Smith, Theocharis Efthymiopoulos, Benjamin Raeder, Adrian M. Stütz, Sebastian M. Waszak, Balca R. Mardin, Jan O. Korbel. Genome-wide Screens Implicate Loss of Cullin Ring Ligase 3 in Persistent Proliferation and Genome Instability in TP53-Deficient Cells. Cell Reports (2020)
- Amezquita RA, Lun ATL, Becht E, Carey VJ, Carpp LN, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pagès H, Smith ML, …, Hicks SC. Orchestrating single-cell analysis with Bioconductor. Nature Methods (2020)
- Aaron T. L. Lun, Hervé Pagès, Mike L. Smith. beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types. PLOS Computational Biology (2018)
- James H.R. Farmery, Mike L. Smith, Andy G. Lynch. Telomerecat: A ploidy-agnostic method for estimating telomere length from whole genome sequencing data. Scientific Reports (2017)
- Weaver JM, Ross-Innes CS, Shannon N, Lynch AG, Forshew T, Barbera M, Murtaza M, Ong CA, Lao-Sirieix P, Dunning MJ, Smith L, Smith ML, Anderson CL, Carvalho B, O’Donovan M, Underwood TJ, May AP, Grehan N, Hardwick R, OCCAMS Consortium. Ordering of mutations in preinvasive disease stages of esophageal carcinogenesis. Nature Genetics (2014)
- Ritchie ME, Dunning MJ, Smith ML, Shi W, Lynch AG. BeadArray expression analysis using bioconductor. PLOS Computational Biology (2011)
- Cairns J, Spyrou C, Stark R, Smith ML, Lynch AG, Tavaré S. BayesPeak - an R package for analysing ChIP-seq data. Bioinformatics (2011)
- Moskvina V, Smith M, Ivanov D, Blackwood D, StClair D, Hultman C, Toncheva D, Gill M, Corvin A, O’Dushlaine C, Morris DW, Wray NR, Sullivan P, Pato C, Pato MT, Sklar P, Purcell S, Holmans P, O’Donovan MC, Owen MJ. Genetic differences between five European populations. Human Heredity (2010)
- Dunning MJ, Smith ML, Ritchie ME, Tavaré S. beadarray: R classes and methods for Illumina bead-based data. Bioinformatics (2007)
- J. Dunning, Mark, P. Thorne, Natalie, Camilier, Isabelle, L. Smith, Michael, Tavaré, Simon. Quality Control and Low-Level Statistical Analysis of Illumina BeadArray},. REVSTAT-Statistical Journal}, (2006)